Portfolio 1 - Cycling Data

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
from matplotlib import pyplot as plt
from datetime import timedelta
from pandas.plotting import scatter_matrix
%matplotlib inline

Analysis of Cycling Data

Loading Data

The first dataset is an export of my ride data from Strava, an online social network site for cycling and other sports. This data is a log of every ride since the start of 2018 and contains summary data like the distance and average speed. It was exported using the script stravaget.py which uses the stravalib module to read data. Some details of the fields exported by that script can be seen in the documentation for stravalib.

The exported data is a CSV file so that's easy to read, however the date information in the file is recorded in a different timezone (UTC) so we need to do a bit of conversion. In reading the data I'm setting the index of the data frame to be the datetime of the ride.

In [2]:
strava = pd.read_csv('data/strava_export.csv', index_col='date', parse_dates=True)
strava.index = strava.index.tz_convert('Australia/Sydney')
strava.head()
Out[2]:
average_heartrate average_temp average_watts device_watts distance elapsed_time elevation_gain kudos moving_time workout_type
date
2018-01-03 07:47:51+11:00 100.6 21.0 73.8 False 15.2 94 316.00 m 10 73 Ride
2018-01-04 12:36:53+11:00 NaN 24.0 131.7 False 18.0 52 236.00 m 5 46 Ride
2018-01-04 13:56:00+11:00 83.1 25.0 13.8 False 0.0 3 0.00 m 2 2 Ride
2018-01-04 16:37:04+11:00 110.1 24.0 113.6 False 22.9 77 246.00 m 8 64 Ride
2018-01-06 06:22:46+11:00 110.9 20.0 147.7 True 58.4 189 676.00 m 12 144 Ride

The second dataset comes from an application called GoldenCheetah which provides some analytics services over ride data. This has some of the same fields but adds a lot of analysis of the power, speed and heart rate data in each ride. This data overlaps with the Strava data but doesn't include all of the same rides.

Again we create an index using the datetime for each ride, this time combining two columns in the data (date and time) and localising to Sydney so that the times match those for the Strava data.

In [3]:
cheetah = pd.read_csv('data/cheetah.csv', skipinitialspace=True)
cheetah.index = pd.to_datetime(cheetah['date'] + ' ' + cheetah['time'])
cheetah.index = cheetah.index.tz_localize('Australia/Sydney')
cheetah.head()
Out[3]:
date time filename axPower aPower Relative Intensity aBikeScore Skiba aVI aPower Response Index aIsoPower aIF ... Rest AVNN Rest SDNN Rest rMSSD Rest PNN50 Rest LF Rest HF HRV Recovery Points NP IF TSS
2018-01-28 06:39:49+11:00 01/28/18 06:39:49 2018_01_28_06_39_49.json 202.211 0.75452 16.6520 1.31920 1.67755 223.621 0.83441 ... 0 0 0 0 0 0 0 222.856 0.83155 20.2257
2018-01-28 07:01:32+11:00 01/28/18 07:01:32 2018_01_28_07_01_32.json 226.039 0.84343 80.2669 1.21137 1.54250 246.185 0.91860 ... 0 0 0 0 0 0 0 245.365 0.91554 94.5787
2018-02-01 08:13:34+11:00 02/01/18 08:13:34 2018_02_01_08_13_34.json 0.000 0.00000 0.0000 0.00000 0.00000 0.000 0.00000 ... 0 0 0 0 0 0 0 0.000 0.00000 0.0000
2018-02-06 08:06:42+11:00 02/06/18 08:06:42 2018_02_06_08_06_42.json 221.672 0.82714 78.8866 1.35775 1.86002 254.409 0.94929 ... 0 0 0 0 0 0 0 253.702 0.94665 98.3269
2018-02-07 17:59:05+11:00 02/07/18 17:59:05 2018_02_07_17_59_05.json 218.211 0.81422 159.4590 1.47188 1.74658 233.780 0.87231 ... 0 0 0 0 0 0 0 232.644 0.86808 171.0780

5 rows × 362 columns

The GoldenCheetah data contains many many variables (columns) and I won't go into all of them here. Some that are of particular interest for the analysis below are:

Here are definitions of some of the more important fields in the data. Capitalised fields come from the GoldenCheetah data while lowercase_fields come from Strava. There are many cases where fields are duplicated and in this case the values should be the same, although there is room for variation as the algorithm used to calculate them could be different in each case.

  • Duration - overall duration of the ride, should be same as elapsed_time
  • Time Moving - time spent moving (not resting or waiting at lights), should be the same as moving_time
  • Elevation Gain - metres climbed over the ride
  • Average Speed - over the ride
  • Average Power - average power in watts as measured by a power meter, relates to how much effort is being put in to the ride, should be the same as * average_watts' from Strava
  • Nonzero Average Power - same as Average Power but excludes times when power is zero from the average
  • Average Heart Rate - should be the same as average_heartrate
  • Average Cadence - cadence is the rotations per minute of the pedals
  • Average Temp - temperature in the environment as measured by the bike computer (should be same as average_temp)
  • VAM - average ascent speed - speed up hills
  • Calories (HR) - Calorie expendature as estimated from heart rate data
  • 1 sec Peak Power - this and other 'Peak Power' measures give the maximum power output in the ride over this time period. Will be higher for shorter periods. High values in short periods would come from a very 'punchy' ride with sprints for example.
  • 1 min Peak Hr - a similar measure relating to Heart Rate
  • NP - Normalised Power, a smoothed average power measurement, generally higher than Average Power
  • TSS - Training Stress Score, a measure of how hard a ride this was
  • device_watts - True if the power (watts) measures were from a power meter, False if they were estimated
  • distance - distance travelled in Km
  • kudos - likes from other Strava users (social network)
  • workout_type - one of 'Race', 'Workout' or 'Ride'

Some of the GoldenCheetah parameters are defined in thier documentation.

Your Tasks

Your first task is to combine these two data frames using the join method of Pandas. The goal is to keep only those rows of data that appear in both data frames so that we have complete data for every row.

In [4]:
table_merge=pd.merge(cheetah, strava, left_index=True, right_index=True, how='inner')
table_merge.head()
Out[4]:
date time filename axPower aPower Relative Intensity aBikeScore Skiba aVI aPower Response Index aIsoPower aIF ... average_heartrate average_temp average_watts device_watts distance elapsed_time elevation_gain kudos moving_time workout_type
2018-01-28 06:39:49+11:00 01/28/18 06:39:49 2018_01_28_06_39_49.json 202.211 0.75452 16.6520 1.31920 1.67755 223.621 0.83441 ... 120.6 21.0 153.4 True 7.6 17 95.00 m 4 17 Ride
2018-01-28 07:01:32+11:00 01/28/18 07:01:32 2018_01_28_07_01_32.json 226.039 0.84343 80.2669 1.21137 1.54250 246.185 0.91860 ... 146.9 22.0 187.7 True 38.6 67 449.00 m 19 67 Race
2018-02-01 08:13:34+11:00 02/01/18 08:13:34 2018_02_01_08_13_34.json 0.000 0.00000 0.0000 0.00000 0.00000 0.000 0.00000 ... 109.8 19.0 143.0 False 26.3 649 612.00 m 6 113 Ride
2018-02-06 08:06:42+11:00 02/06/18 08:06:42 2018_02_06_08_06_42.json 221.672 0.82714 78.8866 1.35775 1.86002 254.409 0.94929 ... 119.3 19.0 165.9 True 24.3 69 439.00 m 6 65 Ride
2018-02-07 17:59:05+11:00 02/07/18 17:59:05 2018_02_07_17_59_05.json 218.211 0.81422 159.4590 1.47188 1.74658 233.780 0.87231 ... 124.8 20.0 151.0 True 47.1 144 890.00 m 10 134 Ride

5 rows × 372 columns

REQUIRED ANALYSIS

  1. Remove rides with no measured power (where device_watts is False) - these are commutes or MTB rides
In [5]:
not_MTB=table_merge[table_merge.device_watts!=False] #remove device_watts = False

not_MTB['Time Moving']=not_MTB['Time Moving']/60 #convert Time Moving to minutes
not_MTB['Duration']=not_MTB['Duration']/60 #convert Time Moving to minutes

#create new DataFrame for data analysis
new_MTB=not_MTB[['Average Heart Rate', 'Average Speed', 'Average Power', 'distance', 'Elevation Gain', 'Time Moving', 'TSS', 'Duration', 'workout_type']]
new_MTB=new_MTB.sort_values(['distance', 'TSS'])
new_MTB.columns = ['Average Heart Rate', 'Average Speed', 'Average Power', 'Distance', 'Elevation Gain', 'Time Moving', 'TSS', 'Duration', 'Workout Type']
C:\Users\Alyssa Raphaella Lim\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
C:\Users\Alyssa Raphaella Lim\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
  1. Look at the distributions of some key variables: time, distance, average speed, average power, TSS. Are they normally distributed? Skewed?
    • Average Power and Average Speed have left skewed graphs.
    • Distance, Duration, and TSS have right skewed graphs.
In [6]:
fig, axes = plt.subplots(1, 2) #two plots side by side

# Distribution Plot (a.k.a. Histogram)
sns.distplot(new_MTB['Average Power'],
             ax=axes[0])
sns.distplot(new_MTB['Average Speed'],
             ax=axes[1])
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x21849b1edd8>
In [7]:
fig, axes = plt.subplots(1, 3) #two plots side by side

# Distribution Plot (a.k.a. Histogram)
sns.distplot(new_MTB['Distance'],
             ax=axes[0])
sns.distplot(new_MTB['Duration'],
             ax=axes[1])
sns.distplot(new_MTB['TSS'],
             ax=axes[2])
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x21849ce3908>
  1. Explore the relationships between the following variables. Are any of them corrolated with each other (do they vary together in a predictable way)? Can you explain any relationships you observe?
    • Average Heart Rate, Average Power and Average Speed are correlated with each other
    • Distance, Elevation Gain, Time Moving and TSS are correlated with each other
In [8]:
%run scripts/PearsonValue.py
<Figure size 432x288 with 0 Axes>
In [9]:
g = sns.pairplot(new_MTB,
             palette = 'husl')

g.map_upper(corr)
plt.show()
In [10]:
sns.pairplot(new_MTB,
             vars = ['Average Power', 'Average Speed', 'Average Heart Rate'],
             hue = 'Workout Type',
             palette = 'husl')
Out[10]:
<seaborn.axisgrid.PairGrid at 0x2184cfd3eb8>
Average Heart Rate vs Average Power
  • The higher the Average Heart Rate, the more power the cyclist exerted.
  • The cyclist exerts more power in a Race compared to a normal Ride and Workout, which results to higher Average Heart Rate as well.
  • Data that has Average Heart Rate or Average power equal to Zero(0) may be cased by device/external errors.
Average Speed vs Average Power
  • The higher the power, the faster the bike will be. This is supported by Power's formula :
    $Power = Force * Velocity$
  • As per last analysis in the previous plot( Average Heart Rate VS Average Power ), the cylist exerts more power in Races compared to having a normal Ride and Workout. This also results to having faster speed in Races compared to having a normal Ride and Workout.
  • Data that has Average Speed or Average Power equal to Zero(0) may be cased by device/external errors.
Average Heart Rate vs Average Speed
  • The faster the Speed of the bike the higher the Average Heart Rate will be.
  • As per the previous plots, the more power the cyclist exterts the faster the bike will be, and the higher the heart rate he will have.
  • The cyclist, same with the previous plot, has higher Average Heart Rate compared in a Race compared to having a normal Ride and Workout
  • Data that has Average Heart Rate or Average Speed equal to Zero(0) may be cased by device/external errors.
Observation per Workout Type
  • as seen in histogram plot, the values of Workout is almost aligned with Ride and Race values.
In [11]:
sns.pairplot(new_MTB,
             vars = ['Distance', 'Elevation Gain', 'Time Moving', 'TSS'],
             hue = 'Workout Type',
             palette = 'husl')
C:\Users\Alyssa Raphaella Lim\Anaconda3\lib\site-packages\statsmodels\nonparametric\kde.py:487: RuntimeWarning: invalid value encountered in true_divide
  binned = fast_linbin(X, a, b, gridsize) / (delta * nobs)
C:\Users\Alyssa Raphaella Lim\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:34: RuntimeWarning: invalid value encountered in double_scalars
  FAC1 = 2*(np.pi*bw/RANGE)**2
Out[11]:
<seaborn.axisgrid.PairGrid at 0x2184de7dc50>
Distance vs Time Moving
  • The longer the distance the higher the time moving is, which may be supported by the formula:
    $Distance = Speed * Time$
  • We can say that distance and moving time are dependent to each other.
  • Ride has longer distance range compared to the other 2 categories, that's why its moving time is also higher.
  • As seen in the graph, the Race plots has lower value of Time Moving compared to the other two categories. This is because speed and time are indirectly proportional with each other.
Distance vs TSS
  • The longer the distance the higher the TSS.
    $TSS = (Intensity Factor)^2 * Ride Time (hours) * 100$
  • Moving Time which we can say dependent to distance, affects TSS.
Elevation Gain vs Distance
  • For the Ride and Race categories, the longer the Distance travelled by the cyclist the higher his elevation gain is.
  • There are some plots with a higher Elevation Gain even though the horizontal distance is the same, these plots have higher vertical distance.
  • The horizaontal distance affects the Elevation Gain rate of the cyclist:
    $Elavation Gain = \frac{Veritical Distance}{Horizontal Distance} * 100$
  • The Workout plot has Zero(0) elevation gain because the cyclist is cycling in stationary position.
Elevation Gain vs Time Moving
  • For Race and Ride categories, the higher the Elevation Gain rate is, the higher the time moving.
  • This is because Vertical Distance and Gravity, slower the acceleration of the cyclist that leads to higher time moving.
    $Time = \frac{Distance}{Average Speed}$
  • The Workout plot has Zero(0) elevation gain because the cyclist is cycling in stationary position.
Elevation Gain vs TSS
  • For Race and Ride categories, the higher the Elevation Gain rate is, the higher the TSS.
  • The Workout plot has Zero(0) elevation gain because the cyclist is cycling in stationary position.
TSS vs Time Moving
  • The longer the higher the TSS, the longer the Time Moving period is.
    $TSS = (Intensity Factor)^2 * Ride Time (hours) * 100$
  • Since the Race category has the shortest Time Moving period, it's TSS is also lower compared to the other two catgories.
  • Time Moving is affected by the Speed.
    $Time = \frac{Distance}{Average Speed}$
  1. We want to explore the differences between the three categories: Race, Workout and Ride.
    • Use scatter plots with different colours for each category to explore how these categories differ.
    • Use histograms or box plots to visualise the different distributions of a variable for the three categories.
    • In both cases, experiment with different variables but only include those that are interesting in your final notebook (if none are interesting, show us a representative example).
In [12]:
# Melt DataFrame
melt_MTB = pd.melt(new_MTB, 
                    id_vars=['Workout Type'],
                    var_name="Vars")
In [13]:
plt.figure(figsize=(10,6)) #plot size

#box plot
sns.boxplot(x = 'Vars',
            y = 'value',
            data = melt_MTB,
            hue = 'Workout Type',
            palette = 'husl') 

plt.title('Workout Category')
plt.legend(bbox_to_anchor=(1, 1), loc=2)
plt.xticks(rotation=-10)
Out[13]:
(array([0, 1, 2, 3, 4, 5, 6, 7]), <a list of 8 Text xticklabel objects>)
In [14]:
plt.figure(figsize=(10,6)) 
 
sns.swarmplot(x = 'Vars', 
              y = 'value', 
              data = melt_MTB, 
              hue = 'Workout Type', 
              split = True,
              palette = 'husl')
 
plt.legend(bbox_to_anchor=(1, 1), loc=2)
plt.xticks(rotation=-10)
C:\Users\Alyssa Raphaella Lim\Anaconda3\lib\site-packages\seaborn\categorical.py:2974: UserWarning: The `split` parameter has been renamed to `dodge`.
  warnings.warn(msg, UserWarning)
Out[14]:
(array([0, 1, 2, 3, 4, 5, 6, 7]), <a list of 8 Text xticklabel objects>)
  • Duration and Time Moving have equal values , for categories Race and Workout.
  • While Ride has longer Duration compared to Time Moving, which means that in a normal ride, the cyclist takes breaks unlike in the two other categories that ride is continuous(no stops).
  • Looking at the three graphs we can observe the relationship of Average Speed, Duration, and Time Moving. The lower the Average Speed is the higher the Duration and Time Moving of the cyclist.
  • Though Ride's Elevation Gain is significantly higher than the others, its Average Heart Rate value is smaller compared to Race and Workout . This is due to the Average Speed and Average Power of Race and Workout are higher.
  • Though Ride's Time Moving is significantly higher than the other two categories, its distance only has small difference from the other two. This is because its Average Speed is lower compared to the other two categories.

Challenge

  • What leads to more kudos? Is there anything to indicate which rides are more popular? Explore the relationship between the main variables and kudos. Show a plot and comment on any relationship you observe.
In [15]:
plt.figure(figsize=(10,6)) 
 
sns.barplot(x = 'kudos', 
            y = 'distance', 
            data = not_MTB, 
            hue = 'workout_type', 
            palette = 'husl',
            ci = None)
 
plt.legend(bbox_to_anchor=(1, 1), loc=2)
plt.xticks(rotation=-10)
Out[15]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22]), <a list of 23 Text xticklabel objects>)
  • Longer rides receives more Kudos .
  • Race category receives more Kudos compared to the other two, even though its Avearage Distance is shorter .
In [16]:
plt.figure(figsize=(10,6)) 
 
sns.barplot(x = 'kudos', 
            y = 'Average Speed', 
            data = not_MTB, 
            hue = 'workout_type', 
            palette = 'husl',
            ci = None)
 
plt.legend(bbox_to_anchor=(1, 1), loc=2)
plt.xticks(rotation=-10)
Out[16]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22]), <a list of 23 Text xticklabel objects>)
  • Race category has higher distance compared to others, but it seems like Average Speed doesn't affect the number of Kudos .
  • The Average Speed from different categories are consistent , yet their number of Kudos are varying.
In [17]:
plt.figure(figsize=(10,6)) 
 
sns.barplot(x = 'kudos', 
            y = 'TSS', 
            data = not_MTB, 
            hue = 'workout_type', 
            palette = 'husl',
            ci = None)
 
plt.legend(bbox_to_anchor=(1, 1), loc=2)
plt.xticks(rotation=-10)
Out[17]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22]), <a list of 23 Text xticklabel objects>)
  • The Average TSS varies per number of Kudos
  • Higher TSS per catergory doesn't guarantee more number of kudos .
  • There are TSS which have high values but lesser number of kudos , and others which has lower TSS but higher number of kudos .
Observations:
  • Category of ride matters to get more Kudos . Races receives the most Kudos, compared to Rides and Workouts.
  • The longer the distance the more Kudos is received by the rider.
  • As seen in the second and third graphs, Average Speed doesn't really count in receiving more Kudos, same with Average TSS .

Challenge

  • Generate a plot that summarises the number of km ridden each month over the period of the data. Overlay this with the sum of the Training Stress Score and the average of the Average Speed to generate an overall summary of activity.
In [18]:
%run scripts/CBtwo

#Display DataFrame
df_DT=pd.DataFrame.from_records(Summary, MY)
df_DT=df_DT.dropna()
df_DT.columns=['Total Distance', 'Total TSS', 'Average Speed']
df_DT
Out[18]:
Total Distance Total TSS Average Speed
Jan2018 46.2 114.8044 30.230700
Jan2019 423.2 1057.9363 15.231867
Feb2018 476.8 1087.2924 23.471867
Feb2019 494.9 1269.9272 23.938965
Mar2018 508.0 1381.0867 23.345871
Mar2019 552.0 1611.9748 26.884221
Apr2018 450.2 1324.5363 22.974325
Apr2019 640.4 1609.1064 23.602786
May2018 339.5 718.8654 21.611460
May2019 610.1 1591.1052 26.151856
Jun2018 193.4 586.4858 27.037100
Jun2019 522.6 1467.1178 26.003711
Jul2018 190.0 381.4320 22.065450
Jul2019 367.2 1096.1419 26.565400
Aug2018 214.9 370.5251 17.990422
Sep2018 204.9 627.2077 23.970480
Oct2018 440.7 1023.1827 24.537392
Nov2018 692.6 1793.3806 25.171586
Dec2018 585.1 1498.1566 26.394479
<Figure size 432x288 with 0 Axes>
In [19]:
sns.pairplot(df_DT)
Out[19]:
<seaborn.axisgrid.PairGrid at 0x2184f838860>

Challenge

  • Generate a similar graph but one that shows the activity over a given month, with the sum of the values for each day of the month shown. So, if there are two rides on a given day, the graph should show the sum of the distances etc for these rides.
In [20]:
func(5,2019)
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x21850d454e0>

Portfolio 2 - Energy Consumption Data

In [1]:
%matplotlib inline
%run scripts/Portfolio2_util.py
import warnings; warnings.simplefilter('ignore')
In [2]:
energydata.head()
Out[2]:
Appliances lights T1 RH_1 T2 RH_2 T3 RH_3 T4 RH_4 ... Visibility Tdewpoint rv1 rv2 Month Day Hour Week NSM Week Status
date
2016-01-11 17:00:00+00:00 60 30 19.89 47.596667 19.2 44.790000 19.79 44.730000 19.000000 45.566667 ... 63.000000 5.3 13.275433 13.275433 Jan Monday 17 2 17.000000 0
2016-01-11 17:10:00+00:00 60 30 19.89 46.693333 19.2 44.722500 19.79 44.790000 19.000000 45.992500 ... 59.166667 5.2 18.606195 18.606195 Jan Monday 17 2 17.166667 0
2016-01-11 17:20:00+00:00 50 30 19.89 46.300000 19.2 44.626667 19.79 44.933333 18.926667 45.890000 ... 55.333333 5.1 28.642668 28.642668 Jan Monday 17 2 17.333333 0
2016-01-11 17:30:00+00:00 50 40 19.89 46.066667 19.2 44.590000 19.79 45.000000 18.890000 45.723333 ... 51.500000 5.0 45.410389 45.410389 Jan Monday 17 2 17.500000 0
2016-01-11 17:40:00+00:00 60 40 19.89 46.333333 19.2 44.530000 19.79 45.000000 18.890000 45.530000 ... 47.666667 4.9 10.084097 10.084097 Jan Monday 17 2 17.666667 0

5 rows × 34 columns

Energy Consumption of Household per month from January to May
  • no specific pattern observed
In [3]:
energydata.plot.line(x = 'Month', 
                     y = 'Appliances',
                     figsize=(10,6),
                     color = '#2E86C1',
                     title='Appliances Consumption Per Month',
                     legend = False)
Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x2603d697128>
Energy Consumption of Household in the first week
  • no specific patter observed
In [4]:
energydata['Date'] = [d.date() for d in energydata.index]
In [5]:
def func(start, end):
    range_date = energydata[start:end]
    return range_date.plot.line(y = 'Appliances',
                                figsize=(10,6), 
                                color = '#2E86C1', 
                                title = 'Frequency of Energy Consumption', 
                                legend = False)

func('2016-01-13', '2016-01-19')
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x2603db2add8>
Frequency of Household energy consumption from the Appliance column of the data
In [6]:
plt.figure(figsize=(10,6))

sns.distplot(energydata['Appliances'],
             kde = False,
             bins = 80,
             color = '#2E86C1').set_title('Frequency of Energy Consumption')
Out[6]:
Text(0.5, 1.0, 'Frequency of Energy Consumption')
In [7]:
plt.figure(figsize=(10,6))

sns.boxplot(x=energydata['Appliances'], 
            color = '#2E86C1').set_title('Frequency of Energy Consumption')
Out[7]:
Text(0.5, 1.0, 'Frequency of Energy Consumption')
In [8]:
%run scripts/PearsonValue.py
<Figure size 432x288 with 0 Axes>

Training Set : Pair Plot

  • Appliances
  • Lights
  • T1 and RH1 : Temperature and Humidity Kithchen
  • T2 and RH2 : Temperature and Humidity Living Room
  • T3 and RH3 : Temperature and Humidity Laundry Room
In [9]:
g = sns.pairplot(energydata,
             vars = ['Appliances', 'lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3', 'RH_3'],
             palette = 'husl',
             diag_kind = 'kde',
             diag_kws=dict(shade=True))

g.map_upper(corr)
g.map_lower(sns.regplot, line_kws = {'color':'red'})
plt.show()

Training Set : Pair Plot

  • Appliances
  • T4 and RH4 : Temperature and Humidity Office
  • T5 and RH5 : Temperature and Humidity Bathroom
  • T6 and RH6 : Temperature and Humidity Outdoor
In [10]:
g = sns.pairplot(energydata,
             vars = ['Appliances', 'T4', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6'],
             palette = 'husl',
             diag_kind = 'kde',
             diag_kws=dict(shade=True))

g.map_upper(corr)
g.map_lower(sns.regplot, line_kws = {'color':'red'})
plt.show()

Training Set : Pair Plot

  • Appliances
  • T_out : Temperature weather station
  • Press_mm_hg : Pressure from weather station
  • RH_out : Humidity weather station
  • Windspeed from weather station
  • Visibility from weather station
  • Tdewpoint from weather station
  • NSM : Number of seconds from midnight
  • T6 : Temperature outside the building
In [11]:
g = sns.pairplot(energydata,
             vars = ['Appliances', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed', 'Visibility', 'Tdewpoint', 'NSM','T6'],
             palette = 'husl',
             diag_kind = 'kde',
             diag_kws=dict(shade=True))

g.map_upper(corr)
g.map_lower(sns.regplot, line_kws = {'color':'red'})
plt.show()
Hourly energy consumption of appliances heat map for four consecutive weeks
  • no specific pattern observed in daily energy consumption
In [12]:
heatmap_func(2,3,4,5)

RFE & Linear Regression : Normalized Data

  • Chose 9 features from the dataset columns
In [13]:
RFE_df
Out[13]:
Feature Coefficient
0 lights 0.170718
1 RH_1 0.633667
2 T2 -0.476669
3 RH_2 -0.687333
4 T3 0.484559
5 T4 -0.123451
6 T6 0.326372
7 T_out -0.306186
8 NSM 0.151052
Tally of Mean Sqaured Error and R-Sqaured per Model
  • Mean Sqaured Error the lower the error value (absolute value) the better the model
  • R-Sqaured the higher the value (absolute value) the better the model
In [14]:
stats_df
Out[14]:
MSE_Train MSE_Test R2_Train R2_Test MAE_Train MAE_Test MAPE_Train MAPE_Test
Model
Linear Regression 0.839834 0.856303 0.160109 0.143523 0.525953 0.516919 192.085243 158.970274
Random Forest 0.072051 0.500759 0.927944 0.499140 0.123854 0.335718 65.755171 140.555058
Gradient Boosting 0.013635 0.520382 0.986364 0.479512 0.073066 0.332712 38.083462 132.665987
SVR 0.732529 0.783631 0.267422 0.216210 0.350883 0.362522 92.145475 85.014203

Cross Validation per Model, RMSE and R-Sqaured

  • As seen in R-Sqaured and RMSE values, SVR is the best model used in this activity. While Random Forest and Gradient Boosting perform the worst
In [15]:
model_list = [linear_model.LinearRegression(), 
              RandomForestRegressor(n_estimators=20, random_state=0), 
              GradientBoostingRegressor(n_estimators=20, learning_rate=0.25, max_depth=13, random_state=0),
              SVR(kernel = "rbf")]
model_name = ['Linear Regression', 'Random Forest', 'Gradient Boosting', 'SVR']

X = energydata[all_cols]
y = energydata['Appliances']
cv_rec = []

for i in range(4):
    model = model_list[i]
    r2_cv = cross_val_score(model, X, y, cv=5, scoring = 'r2').mean()
    rmse_cv = cross_val_score(model, X, y, cv=5, scoring = 'neg_mean_squared_error').mean()
    cv_rec.append((model_name[i], r2_cv, rmse_cv))
    
cv_df = pd.DataFrame(cv_rec)
cv_df.columns = ['Model', 'R-Squared', 'RMSE']
cv_df.index = cv_df['Model']
cv_df = cv_df.drop('Model',
                axis = 1)
In [16]:
plt.figure(figsize=(5,3))
cv_df['R-Squared'].plot(kind = 'barh', 
           title='R-Squared', 
           legend = False)

plt.figure(figsize=(5,3))
cv_df['RMSE'].plot(kind = 'barh', 
           title='RMSE', 
           legend = False)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x26051c9d630>
Feature Importance per Model

As seen from the graphs below, both models give NSM as the most important variable, but the rest of the varibles differ in importance per model.

In [17]:
X_train_rf = utils.multiclass.type_of_target(X_train.astype('int'))
y_train_rf = utils.multiclass.type_of_target(y_train.astype('int'))
In [18]:
forest = ExtraTreesClassifier(n_estimators=20,
                              random_state=0)
X = energydata[all_cols]
y = energydata['Appliances']

forest.fit(X, y)
importances = forest.feature_importances_
RF_data = {'Features': all_cols, 'Importance': importances}
RF_df = pd.DataFrame(RF_data)
RF_df = RF_df.sort_values(by = ['Importance'])

plt.figure(figsize=(10,6))
RF_df.plot(kind = 'barh', 
           x = 'Features', 
           y = 'Importance', 
           title='RF Feature Importance', 
           legend = False)
plt.xlabel('Importance')
Out[18]:
Text(0.5, 0, 'Importance')
<Figure size 720x432 with 0 Axes>
In [19]:
from xgboost import XGBClassifier
from xgboost import plot_importance

model = XGBClassifier()
model.fit(X, y)

plot_importance(model)
plt.figure(figsize=(10,6))
Out[19]:
<Figure size 720x432 with 0 Axes>
<Figure size 720x432 with 0 Axes>
In [ ]:
 

Portfolio 3 - Clustering Visualisation

K-means clustering is one of the simplest and popular unsupervised learning algorithms. Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes. This notebook illustrates the process of K-means clustering by generating some random clusters of data and then showing the iterations of the algorithm as random cluster means are updated.

We first generate random data around 4 centers.

In [1]:
import numpy as np 
import pandas as pd 
from matplotlib import pyplot as plt
%matplotlib inline

import copy
In [2]:
center_1 = np.array([1,2])
center_2 = np.array([6,6])
center_3 = np.array([9,1])
center_4 = np.array([-5,-1])

# Generate random data and center it to the four centers each with a different variance
np.random.seed(5)
data_1 = np.random.randn(200,2) * 1.5 + center_1
data_2 = np.random.randn(200,2) * 1 + center_2
data_3 = np.random.randn(200,2) * 0.5 + center_3
data_4 = np.random.randn(200,2) * 0.8 + center_4

data = np.concatenate((data_1, data_2, data_3, data_4), axis = 0)

plt.scatter(data[:,0], data[:,1], s=7, c='k')
plt.show()

1. Generate random cluster centres

You need to generate four random centres.

This part of portfolio should contain at least:

  • The number of clusters k is set to 4;
  • Generate random centres via centres = np.random.randn(k,c)*std + mean where std and mean are the standard deviation and mean of the data. c represents the number of features in the data. Set the random seed to 6.
  • Color the generated centers with green, blue, yellow, and cyan. Set the edgecolors to red.
In [3]:
std = {1: 1.25, 2: 1.75, 3: 1.5, 4: 2}

k = 4
np.random.seed(6)
mean = {i+1: [np.random.randint(1,20), np.random.randint(1,20)]
          for i in range(k)}

centres = {i+1: np.random.randn(100,2) * std[i+1] + mean[i+1]
          for i in range(k)} 
df_centres = pd.DataFrame()
for i in centres.keys():
    df_dump = pd.DataFrame(centres[i])
    df_centres = df_centres.append(df_dump, ignore_index=True)
df_centres = df_centres.rename(columns = {0: 'x', 1: 'y'})

colmap = {1: 'g', 2: 'b', 3: 'y', 4: 'c'}
for i in mean.keys():
    plt.scatter(centres[i][:,0], centres[i][:,1], color = 'k', alpha = 0.2)
    plt.scatter(*mean[i], color = colmap[i], edgecolors = 'r')
plt.show()

2. Visualise the clustering results in each iteration

You need to implement the process of k-means clustering. Implement each iteration as a seperate cell, assigning each data point to the closest centre, then updating the cluster centres based on the data, then plot the new clusters.

  • Cluster the points according to the distance from the assigned centroids.
In [4]:
def clustering(df_centres, mean):
    for j in df_centres.iterrows():
        for i in mean.keys():
            df_centres['distance_from_{}'.format(i)] = (np.sqrt((df_centres['x'] - mean[i][0])**2 + 
                                                        (df_centres['y'] - mean[i][1])**2))
    
    mean_dist = ['distance_from_{}'.format(i)
                   for i in mean.keys()]
    df_centres['closest'] = df_centres.loc[:, mean_dist].idxmin(axis=1) #outputs data of min distance > index number
    df_centres['closest'] = df_centres['closest'].map(lambda x: int(x.lstrip('distance_from_'))) # assigns index number to closest
    df_centres['color'] = df_centres['closest'].map(lambda x:colmap[x])
    return df_centres

mean_old = copy.deepcopy(mean)
def update(mean):
    for i in mean.keys():
        mean[i][0] = np.mean(df_centres[df_centres['closest'] == i]['x'])
        mean[i][1] = np.mean(df_centres[df_centres['closest'] == i]['y'])
    return mean
In [5]:
df_centres = clustering(df_centres, mean)

plt.scatter(df_centres['x'], df_centres['y'], color = df_centres['color'], alpha = 0.2)
for i in centres.keys():
    plt.scatter(*mean[i], color = colmap[i], edgecolors = 'r')
plt.show()
  • Iterates the calculations of the points, to enhance the positions of the centroid and updates it. Stabilization of the centroid means that the clustering has been successful.
In [6]:
df_mean = pd.DataFrame()

x_diff = 1
y_diff = 1
while x_diff != 0 and y_diff != 0: 
    mean_old = copy.deepcopy(mean)
    mean = update(mean)
    
    mean_dp = pd.DataFrame(mean)
    mean_dp = mean_dp.T 
    mean_dp = mean_dp.rename(columns = {0: 'x', 1: 'y'})
    mean_dp['xy'] = mean_dp.apply(lambda x: [x['x'], x['y']], axis=1)
    mean_dp = mean_dp.drop(['x', 'y'], axis=1)
    mean_dp = mean_dp.T
    df_mean = df_mean.append(mean_dp, ignore_index = True)
        
    df_centres = clustering(df_centres, mean)
    
    for i in mean.keys():
        x_diff = (mean[i][0] - mean_old[i][0])
        y_diff = (mean[i][1] - mean_old[i][1])
     
plt.scatter(df_centres['x'], df_centres['y'], color = df_centres['color'], alpha = 0.2)
for i in centres.keys():
    plt.scatter(*mean[i], color = colmap[i], edgecolors = 'r')
plt.show()
In [7]:
df_mean = df_mean.rename(columns = {1: 'Cluster 1', 2: 'Cluster 2', 3: 'Cluster 3', 4: 'Cluster 4'})
df_mean
Out[7]:
Cluster 1 Cluster 2 Cluster 3 Cluster 4
0 [11.135498526776935, 10.080694829885076] [3.9731241546127456, 11.068024616697656] [14.27152288202329, 15.866046526800409] [10.52193353734501, 17.23284789565484]
1 [11.135498526776935, 10.080694829885076] [3.9731241546127456, 11.068024616697656] [14.217839961551302, 15.870001267035816] [10.45959685947391, 17.27638924327624]
2 [11.135498526776935, 10.080694829885076] [3.9731241546127456, 11.068024616697656] [14.217839961551302, 15.870001267035816] [10.45959685947391, 17.27638924327624]
In [ ]: